Create initial library version #1

M-Fedor · 2021-04-22T08:26:51Z

No description provided.

M-Fedor · 2021-04-22T08:38:56Z

source/input_reader.py

+
+elif platform == 'win32':
+
+    class DirectoryWatcherWindows(InputReaderInterface):


This is just a prototype now. Will be tested properly on Windows platform.

M-Fedor · 2021-04-22T08:40:25Z

source/task_executor.py

+
+    read_id: str
+    filename: str
+    caller: BasecallerInterface


This is a design flaw. It works fine with BasecallerMock, but can harm the performance once we have large class objects to be serailized. I'll rework this.

ppershing · 2021-04-22T18:45:56Z

source/input_reader.py

+        return self.task_batch
+
+
+if platform == 'linux':


given how big are these, I would split them into platform/xyz files so that this file simply re-exports correct watcher for given platform

I was hoping that someone gives reaction to that. Thanks. When located in single module I had a problem that platform specific code could not be interpreted properly. Therefor those if statements. But your solution solves the problem in more elegant way.

ppershing · 2021-04-22T18:48:28Z

source/argument_parser.py

+                    OutputFormatterFastq, SequentialTaskExecutor, ParallelTaskExecutor)
+
+
+class ArgumentParser(ArgumentParserInterface):


I see no reason for this to be class if it's only interaction is create it & then call parse_arguments once. I would make it just a function parse_arguments

Well, it also initializes LibraryComponents class. Instead of returning it all I figured, it would be more understandable, if I designed it as a class. Also, if someone else will want to use that library in a future and he will need to implement his own ArgumentParser, he can simply create a subclass of ArgumentParserInterface and generally just use the existing code infrastructure.

I would suggest using against patterns like subclass. OOP, as done by many, is done wrong. In most cases, functional design fares way better in terms of maintainability, clarity and testability. Perhaps checou out some Sandi Metz presentations, they are usually good inspiration

I agree that this design decision is just silly. But I have a question. I don't like about python libraries that internal code is just mixed with API methods and the only distinction between them is just an underscore at the beginning of a function name. I really liked here that all of the implementation is separated from the library interface and it's clear for a new-comer what the interface looks like and what is the intended usage without spending too much time on trying to understand what the f*** it does. Just look at that:
https://github.com/nanoporetech/ont_fast5_api
It's just tedious piece of code for me to use.

If I use a functional design for this component, is there any preferred pythonic way to declare but not implement functions a thus create pretty interface? Or should I just create a commented document describing the functionality in interface folder? Maybe it's not that necessary in this small component, but I would like to know for a future reference. Thanks!

Can we also discuss this one @ppershing ?

ppershing · 2021-04-22T18:49:55Z

source/argument_parser.py

+
+    def parse_arguments(self) -> Arguments:
+
+        self.parser.add_argument('--directory', type=str, nargs='*',


this is weird that you add arguments to parser only in parse_args

I'm not sure I understand what you mean.

right now, parse_arguments is not idempotent despite it's name suggesting that it does not mutate anything. Try to run it twice and see the fireworks happen ...

Functional design should fix this one.

ppershing · 2021-04-22T18:51:17Z

source/output_formatter.py

+
+class OutputFormatterFastq(OutputFormatterInterface):
+
+    def write_output(self, output_data: List[OutputData]) -> None:


is it List or Sequence?

It is List, could be a Sequence, but in my opinion should be a Tuple. It should be a tuple since returned by TaskExecutor. So it would be enforced by design that those data are immutable.

If the function does not care about whether it is list, tuple, or sequence, it should take the superset of them, i.e. the Sequence. Especially if you don't see the requirement changing in the future. In fact, you might want to pass output data generator to the function instead of passing a list

ppershing · 2021-04-22T18:52:26Z

source/output_formatter.py

+            if len(basecalled_seq) == 0:
+                return
+
+            print(">%s" % read_id, file=self.output_stream)


what-the? How could an "interface" have self.output_stream? 🤔

In python there's not really such strict concept interface. Here I preferred code reuse in baseclass over the interface nature. So OutputFormatterInterface defines public API and also implements som common init() method for all the subclasses. It could be discussed though, what are your ideas?

not really such strict concept interface.

If you name something as interface, it better be interface

Fair enough.

ppershing · 2021-04-26T18:03:46Z

source/argument_parser.py

+        return self.library_components
+
+
+def _initialize_input_reader(arguments: argparse.Namespace) -> InputReaderInterface:


it's unclear why this needs to get whole argspace namespace when it can take bool.
Perpahs more importantly, watch_directory should not be bool but use "store" feature of argparse to store some value to it

But on the other look, at least the interface is consistent with the rest of the helper functions so maybe it is ok as it is

ppershing · 2021-04-26T18:08:09Z

source/output_formatter.py

+
+        for output in output_data:
+
+            (read_id, run_id, read_num, channel_num, start_time, basecalled_seq, quality_scores) = output


this begs for named tuple

output_data is a named tuple. But I was thinking that putting it into print() function diretly from a named tupple like output.read_id would take even more space and be even more ugly. What do you think?

Well, for the starter, this does not check whether you do correct naming at destructuring. So it is easy to make a mistake.
Second, you can just rename output to o and then it would be more natural.
Finally, you can probably use f-strings instead of format interpolation (assuming python 3) though I guess that is up to discussion whether it will make things nicer or not

ppershing · 2021-04-26T18:10:30Z

interface/input_reader.py

+        This method is always called by ArgumentParser and should NOT be
+        called by integrator.
+        """
+        raise NotImplementedError


why not @AbstractMethod ?

ppershing · 2021-04-26T18:14:34Z

interface/output_formatter.py

+
+class OutputFormatterInterface:
+
+    def __init__(self, output_name: str, compressed_output: bool) -> None:


This is exact OOP antipattern.
You should have OutputFormatter instance take "FileLike" argument and let it just do the job of writing to it. Or alternatively, supply "write_output" function to this class. The way it is now, you are mixing two responsibilities in one class hierarchy

ppershing · 2021-04-26T18:14:46Z

interface/output_formatter.py

+        else:
+            self.output_stream = open(output_name, "w")
+
+    def __del__(self) -> None:


del does not work correctly, if you want RAII style, you have to use "with" statement and context managers

ppershing · 2021-04-26T18:15:29Z

interface/task_executor.py

+from .data_types import FilePath, OutputData
+
+@dataclass
+class BasecallerInterface:


How can dataclass be an interface? That make absolutely zero sense

ppershing · 2021-04-26T18:16:13Z

interface/task_executor.py

+    def __init__(self) -> None:
+        self.caller = None
+
+    def set_caller(self, caller: BasecallerInterface) -> None:


shouldn't it rather take caller as constructor param?

Well, I'm not sure this is the best design, but I couldn't do better. caller is an external dependency. This library should be able to utilize any basecaller as long as it implements the BasecallerInterface. While ArgumentParser provides integrator with created and initialized TaskExecutor object, it has no way to know what caller to supply. And I think it shouldn't have. It should be an integrator responsibilty in my opinion to supply a correct caller.

Well, that tells you that your whole design is off and has cross-cutting concerns. For example, you can have execute_task_batch to also take caller as an argument. That way, TaskExecutor deals only with executing in some fashion and does not care what the caller is. You can then "bind" caller to task executor in another level of abstraction

That is true. But once I have caller as parameter of execute_task_batch() I have to serialize the caller object in parallel execution to communicate the caller object to the workers. If it was a data member, function _call_read() could be made a class method in some TaskExecutorBase class other executors would inherit from and caller object could be accessed naturally. (My original plan with this). That way it would be beneficial for performance since serializing large caller objects can be costly.

Hmm, ok, now I see this. But then I don't understand how this works if you set caller after constructing the pool -- once pool is created, I would assume workers are already spawned and it would be too late to assign caller in the parent process

THAT is an excellent question. Haven't thought about it that way.

ppershing · 2021-06-27T14:03:39Z

interface/data_types.py

+from typing import NamedTuple
+
+
+class OutputData(NamedTuple):


given that this has very nasty confusable arguments (string, string, int, string, ....)
I strongly vote against named tuple. It should be data class instead

I've read a few articles and can't figure out what would be the difference. Can you elaborate, please?

Sure. Namedtuple can be destructured using tuple destructuring. Dataclass cannot. This prevents you from accidentally shooting yourself to foot as you always must access through named parameters

ppershing · 2021-06-27T14:04:14Z

interface/data_types.py

+    Arguments can be obtained from ArgumentParser object.
+    """
+
+    watch_directory: bool


confusing name. From the name itself I would expect this to contain string/path

Create initial library version

6d7c614

M-Fedor requested review from ppershing, usamec and tvinar April 22, 2021 08:26

M-Fedor self-assigned this Apr 22, 2021

M-Fedor commented Apr 22, 2021

View reviewed changes

ppershing reviewed Apr 22, 2021

View reviewed changes

ppershing reviewed Apr 26, 2021

View reviewed changes

ppershing reviewed Jun 27, 2021

View reviewed changes


		elif platform == 'win32':

		class DirectoryWatcherWindows(InputReaderInterface):

		OutputFormatterFastq, SequentialTaskExecutor, ParallelTaskExecutor)


		class ArgumentParser(ArgumentParserInterface):


		def parse_arguments(self) -> Arguments:

		self.parser.add_argument('--directory', type=str, nargs='*',


		class OutputFormatterFastq(OutputFormatterInterface):

		def write_output(self, output_data: List[OutputData]) -> None:

		return self.library_components


		def _initialize_input_reader(arguments: argparse.Namespace) -> InputReaderInterface:


		for output in output_data:

		(read_id, run_id, read_num, channel_num, start_time, basecalled_seq, quality_scores) = output


		class OutputFormatterInterface:

		def __init__(self, output_name: str, compressed_output: bool) -> None:

Create initial library version #1

Are you sure you want to change the base?

Create initial library version #1

Conversation

M-Fedor commented Apr 22, 2021

Choose a reason for hiding this comment

M-Fedor Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

M-Fedor Apr 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ppershing Apr 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

M-Fedor Apr 22, 2021 •

edited

Loading

M-Fedor Apr 24, 2021 •

edited

Loading

ppershing Apr 26, 2021 •

edited

Loading